Skip to content

Conversation

@ylpoonlg
Copy link
Contributor

@ylpoonlg ylpoonlg commented Feb 2, 2026

Contributes to #94019 .

@dotnet/arm64-contrib @a74nh

@dotnet-policy-service dotnet-policy-service bot added the community-contribution Indicates that the PR has been added by a community member label Feb 2, 2026
@dotnet-policy-service
Copy link
Contributor

Tagging subscribers to this area: @dotnet/area-system-runtime-intrinsics
See info in area-owners.md if you want to be subscribed.

@stephentoub
Copy link
Member

🤖 Copilot Code Review — PR #123890

Holistic Assessment

Motivation: This PR adds SVE2 non-temporal gather load intrinsics (LDNT1B, LDNT1SH, LDNT1SW, LDNT1H, LDNT1W, LDNT1D), contributing to issue #94019. Non-temporal loads are valuable for streaming data access patterns where cache pollution should be avoided. The need for these APIs is justified by the approved API proposal.

Approach: The implementation follows the established pattern for SVE gather loads, adding 12 new intrinsic entries in the JIT tables, corresponding managed APIs in Sve2.cs, and codegen support. The approach of manually emitting LSL instructions to compute byte offsets from element indices is correct since SVE2 LDNT1 instructions (unlike standard LD1 gathers) don't support embedded shift immediates.

Summary: ⚠️ Needs Changes. The implementation has one correctness issue in the codegen and lacks active test coverage. These should be addressed before merge.


Detailed Findings

❌ Codegen correctness — GatherVectorNonTemporal may fail for 4-byte elements

File: src/coreclr/jit/hwintrinsiccodegenarm64.cpp (lines 2362-2366)

The codegen for NI_Sve2_GatherVectorNonTemporal has this logic:

else if (intrin.id == NI_Sve2_GatherVectorNonTemporal)
{
    assert(emitActualTypeSize(intrin.baseType) == 8);
    GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 3, opt);
}

This asserts that the base type is 8 bytes and always shifts by 3. However, looking at the intrinsic table in hwintrinsiclistarm64sve.h, GatherVectorNonTemporal maps to:

  • INS_sve_ldnt1w for int/uint/float (4-byte elements)
  • INS_sve_ldnt1d for long/ulong/double (8-byte elements)

Verification: The managed APIs in Sve2.cs show that the 4-byte Vector-based overloads (e.g., GatherVectorNonTemporal(Vector<int>, Vector<uint>)) are intentionally commented out per #103297. However, the GatherVectorWithByteOffsetsNonTemporal APIs for 4-byte types (Vector<int>, Vector<float>) are exposed. Since the intrinsic table entry for GatherVectorWithByteOffsetsNonTemporal uses the same instruction (ldnt1w/ldnt1d), this case must be handled correctly.

Looking more carefully at the codegen case list at lines 2334-2345, both NI_Sve2_GatherVectorNonTemporal and NI_Sve2_GatherVectorWithByteOffsetsNonTemporal share the same case block. For the WithByteOffsets variant, no shift is needed (the offsets are already in bytes). However, GatherVectorNonTemporal with indices does need shifting.

Issue: If GatherVectorNonTemporal is called with 4-byte types (int/float) through overloads like GatherVectorNonTemporal(Vector<int>, int*, Vector<uint>) — which is exposed in the managed API — the current codegen will hit the assert and fail, or compute incorrect addresses.

Recommendation: Add handling for the 4-byte case:

else if (intrin.id == NI_Sve2_GatherVectorNonTemporal)
{
    emitAttr baseSize = emitActualTypeSize(intrin.baseType);
    if (baseSize == EA_4BYTE)
    {
        GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 2, opt);
    }
    else
    {
        assert(baseSize == EA_8BYTE);
        GetEmitter()->emitIns_R_R_I(INS_sve_lsl, emitSize, op3Reg, op3Reg, 3, opt);
    }
}

⚠️ Test coverage — Tests are commented out

File: src/tests/Common/GenerateHWIntrinsicTests/Arm/Sve2Tests.cs

The diff shows test entries being added but they are commented out:

// (Templates.SveGatherVectorVectorBases,new Dictionary<string, string> {["TestName"] = "Sve2_GatherVectorByteZeroExtendNonTemporal_Bases_int_uint", ...

While I understand that the Vector-bases overloads were removed per #103297, there should be active tests for the overloads that are being shipped. Currently there appear to be no active test cases for the new APIs.

Recommendation: Enable test coverage for at least:

  1. The (mask, pointer, offsets) overloads for each intrinsic variant
  2. The (mask, Vector<ulong> addresses) overloads for 64-bit element types
  3. Both 4-byte and 8-byte variants to catch the codegen issue above

⚠️ API consistency — Commented-out overloads citing #103297

Files: Sve2.cs, Sve2.PlatformNotSupported.cs

Multiple overloads are commented out with references to issue #103297:

// Removed as per #103297
// public static unsafe Vector<int> GatherVectorNonTemporal(Vector<int> mask, Vector<uint> addresses) => ...

This is fine if intentional, but note that the original API proposal in #94019 included these overloads. Consider:

  1. Adding a tracking issue or linking to Arm64/Sve: Implement the SVE APIs that takes 32-bit address as input #103297 in the PR description to explain why these overloads are excluded
  2. Ensuring the decision to exclude 32-bit Vector-base overloads is documented for future reference

💡 Suggestion — Consider factoring offset-scaling logic

The new codegen block (lines 2346-2378) duplicates the offset-scaling pattern already present for existing SVE gather intrinsics (lines 2252-2310). Consider consolidating this logic into a shared helper or extending the existing switch to cover both SVE and SVE2 gather variants, reducing future drift risk.


✅ Register ordering — Correct for SVE2 LDNT1x encoding

The comment "op2Reg and op3Reg are swapped" is correct. The SVE2 non-temporal gather instructions use the IF_SVE_IX_4A encoding which expects (Target, Pg, VectorOffset, ScalarBase) ordering. Since the intrinsic signature is (Mask, Base, Offsets), swapping the scalar base (op2) and vector offsets (op3) in the emission is necessary and correct.


Models Contributing to This Review

This review was synthesized from analysis by multiple models (Gemini 3 Pro, GPT-5.1-Codex) along with direct code inspection. Both models independently flagged the 4-byte GatherVectorNonTemporal codegen issue as a critical bug.

@ylpoonlg
Copy link
Contributor Author

❌ Codegen correctness — GatherVectorNonTemporal may fail for 4-byte elements

These APIs with 4-byte addresses are commented out so they cannot be tested currently, but adding them for completeness.

⚠️ Test coverage — Tests are commented out
⚠️ API consistency — Commented-out overloads citing #103297

All commented tests/APIs should be related to #103297.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area-System.Runtime.Intrinsics community-contribution Indicates that the PR has been added by a community member

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants